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Abstract 

A wideband fading channel is considered with causal channel state information 
(CSI) at the transmitter and no receiver CSI. A simple orthogonal code with energy 
detection rule at the receiver (similar to [B]) is shown to achieve the capacity of 
this channel in the limit of large bandwidth. This code transmits energy only 
when the channel gain is large enough. In this limit, this capacity without any 
receiver CSI is the same as the capacity with full receiver CSI-a phenomenon also 
true for dirty paper coding. For Rayleigh fading, this capacity (per unit time) is 
proportional to the logarithm of the bandwidth. Our coding scheme is motivated 
from the Gel'fand-Pinsker [2,3] coding and dirty paper coding [4]. Nonetheless, for 
our case, only causal CSI is required at the transmitter in contrast with dirty-paper 
coding and Gel'fand-Pinsker coding, where non-causal CSI is required. 

Then we consider a general discrete channel with i.i.d. states. Each input has an 
associated cost and a zero cost input "0" exists. The channel state is assumed be to 
be known at the transmitter in a causal manner. Capacity per unit cost is found for 
this channel and a simple orthogonal code is shown to achieve this capacity. Later, 
a novel orthogonal coding scheme is proposed for the case of causal transmitter 
CSI and a condition for equivalence of capacity per unit cost for causal and non- 
causal transmitter CSI is derived. Finally, some connections are made to the case 
of non-causal transmitter CSI in ||8j. 



1 Introduction 

We consider a wireless fading channel of a large bandwidth W. The input Xfc[z] of band 
k at time i is related to the output jkli] as: 

yk[i] = hk[^^k[i] + nk[i] l<k<W, te {1,2,3 ■■■} (1) 

where nk[i] is complex circularly symmetric white Gaussian noise of unit variance. Each 
nfc[i] is independent of all inputs, fading gains, and noise in other bands. The fading 

*These results were first mentioned briefly in AUerton Conference, October 2004 and later in LIDS 
Student Conference, January 2005. 
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gains {hfc[z]} are complex Gaussian with variance 1 and are assumed i.i.d. over time and 
frequency. The transmitter has an average power constraint at any time i: 

w 

k=l 

Note that the channel state at time i is completely described by the W channel gains 
{hfe[i] : 1 < k < W}. We assume (for reasons discussed later) that at each time i, the 
transmitter knows this state, i.e. all W fading gains at that time and the receiver has no 
such knowledge. That we are assuming full transmitter CSI and no receiver CSI. 

The case of causal transmitter CSI and no receiver CSI was studied by Shannon for 
discrete channels 0. A discrete channel having \X\ possible inputs and |iS| possible 
states (varying in i.i.d. manner), can be converted to a discrete memoryless channel of 
same output alphabet but a larger input alphabet of size lA"!!*^!. Capacity of the original 
channel equals that of this memoryless channel, which is easier to analyze. 

Later, [2 El studied the following modification of this scenario. There the channel 
state for the entire codeword is known to the transmitter before beginning its transmis- 
sion. Thus the CSI is available to the transmitter in a non-causal manner, whereas the 
receiver has no CSI at all. The optimal code in this case has a large number of candidate 
codewords for each message. The candidate which is suitable to the entire state-sequence 
spanning the code-length is used for transmission. More precisely, a candidate which 
is jointly typical with the state-sequence is used for transmission. This motivates our 
coding scheme for this wideband fading channel, where the codeword candidate which 
benefits the most from the state sequence is used for transmission. 

For the wideband fading channel above, the capacity without any receiver and trans- 
mitter CSI can be achieved by an orthogonal coding scheme like Pulse-Position Modu- 
lation or Frequency- Shift Keying In the limit of large bandwidth, this capacity 
without any CSI equals the capacity with full receiver CSI, which is P logg e bits per unit 
time. 

For the case of full CSI at both ends, the capacity is achieved by water-filling which 
transmits power only when the channel gain is large enough and this capacity was shown 
to be essentially P log2 W bits per unit time for the Rayleigh fading case [7j . For the 
intermediate case of only transmitter CSI, we wish to combine these two ideas of orthogo- 
nal coding and water-filling. We show that one can combine these two ideas without loss 
of optimality, that is, a code combining these two ideas is shown to achieve the capacity 
of this channel. This capacity with only transmitter CSI turns out to be essentially the 
same as the capacity (~ -Plog2 W bits per unit time) with both transmitter and receiver 
having CSI. This is another example where receiver CSI (or lack of it) does not affect 
the wideband capacity. In fact, it turns out that this capacity can be achieved by the 
proposed code with only one bit of transmitter CSI for each channel gain without any 
receiver CSI. 

After noting that transmitter CSI can significantly (by a factor of Inl^) increase 
the capacity of a wideband fading channel irrespective of receiver CSI, we address the 
assumption of having transmitter CSI without any receiver CSI. This may seem to be a 
peculiar assumption for a wireless system because the transmitter in a typical wireless 
system obtains its CSI through feedback from the receiver itself. Nonetheless, after 
feeding back CSI to the transmitter, the receiver may want to ignore the CSI for multiple 
reasons-especially since this does not hurt capacity. 



• Ignoring CSI at the receiver may help in simphfying the decoding algorithm. The 
structure of the proposed orthogonal code (for a receiver with no CSI) may simplify 
the decoder. 

• Another reason for ignoring receiver CSI comes from the fact that obtaining CSI 
at the receiver is intrinsically costly (e.g. in terms of energy spent in training for 
CSI). We see later (in section 2) that if receiver CSI is ignored, obtaining CSI for 
all channels is not necessary. CSI needs to be obtained only for a small fraction of 
channels which reduces the overall cost of obtaining CSI. This saving in the channel 
estimation cost (energy) can bring significant gains in this wideband system, where 
the available energy per degree of freedom is severely limited. 

• In addition to less frequent CSI estimation, ignoring receiver CSI allows for a coarse 
channel estimation. As discussed in section 2, the proposed orthogonal code for a 
receiver with no CSI requires only one bit of CSI per channel. Obtaining this single 
bit of CSI might be easier compared to estimating the exact channel gain. 

The next section describes the coding scheme and proves its achievable rate. Sectional 
considers a general discrete channel with states. For the case of causal CSI, an achievable 
rate for this channel is proved with an orthogonal code. This is later shown to equal its 
capacity. In the last section, the case of causal transmitter CSI is used to interpret the 
case of non-causal transmitter CSI [S]. 



2 Capacity achieving scheme 

Our coding scheme is a modification of a scheme like Frequency-Shift Keying scheme or 
Pulse- Position Modulation, that is, here the transmitter only transmits if the fading gain 
is large. The purpose here is to exploit the channel randomness instead of combating 
it. We will split the total bandwidth W into K pieces, each of bandwidth w = W/K 
and these pieces would be used separately for communication. The available power P 
is also equally divided into these pieces. Next, we illustrate our coding scheme for one 
such piece and analyze its achievable rate r. The total achievable rate would be number 
of pieces K times the rate per piece r. We will use the notation /(x) ~ g{x) to denote 

The code for such a piece of bandwidth w spans T symbols in time. This code uses 
each of the T time indices to denote a message from the set {1,2- --T}. Thus InT 
information nats^ are transmitted in time T and hence the code rate is In T/T nats per 
unit time. Say a total of A energy units are available for this. When message j is to be 
transmitted, these A energy units will be transmitted only at time j. Moreover, these 
entire A units of energy are transmitted on a single frequency band say fj (see Figure 
P). This is the first band where the channel gain for time j is larger than a threshold 
$ = In — ln(2 In If ) . 

/, =min{^: | h,[j] 1^ > $} (2) 

Note that causal transmitter CSI is enough for this purpose. A type I error is declared 
if the channel gains at the time of message j do not cross this threshold for any band. 

^In2 nats = 1 bit. Hence InT nats equals log2 T bits. Unless mentioned otherwise, units of rate are 
nats per unit time. 



The decoder calculates the average (over w bands) received energy Ei for each time 
index i 
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l<i<T 



The time index for which Ei is maximum is declared as the transmitted message. Note 
that no channel state information is needed for this decoding method. 



Frequency 



Message k corresponds 
to a pulse on the strongest 
frequency band at time k 



Time 



Figure 1: Proposed coding scheme: colored symbol indicates energy transmitted. 

Without loss of generality, we will assume that message 1 was transmitted. A type 
II error is declared if Ei the largest for some time other than the time of message 1. 
First, we show that probability Pj of type I error goes to zero for large w. Note that 
type I error occurs if and only if the channel gains of all w bands at time 1 are smaller 
than $, that is, the maximum of those w channel gains is smaller than $. Since each 
I hk[l] P is exponentially distributed, in the limit of large w, their maximum converges 
in distribution to PH] : 



Inw + z ; where distribution of z is P(z < Z) = exp (— e ^) 
With our choice of the threshold $, probability of type I error is 

P/ = P(z < - ln(21n?x;)) = 



(3) 



(4) 



Thus probability of type I error vanishes as w tends to infinity. Now we show that 
probability Pjj of type II error also vanishes as w tends to infinity. Assuming that at 
time 1, the channel gain of band i crosses the threshold $ (i.e. /i = i), the received 
symbol in band i at time 1 is 



Using Eq. 



yjl] = h,[l]v^ + n4l] 
\y,[l]\' = \Hl]\'X+\n,[l]\' + 2Vx^{h,[l]-n,[l]*] 
> $A + |n,[l]p + 2yA3fJ(h,[l]-nai]*) 

for large w 

P{\hj[l]\'^ >\nw + 2\nw) = 



exp 



(5) 
(6) 
(7) 



(8) 
(9) 



Now assuming |hj[l]p < 31nw yields 
5R(h,[l].n,[l]*)>- 



|h,[l]||n, 



> — VS lnw|nj[l]| 



Note that |nj[l]p is an exponential random variable with mean 1. Hence 



P{\nj[l]\'^ >2\nw) = l/w^ (10) 
Now assuming |njl]p < 2\nw implies 



^{hj[l] ■ nj[l]*) > -V3 lnwV2 Inw 
Substituting in (0) implies \yj[M\^ > $A - 2V6X\nw + \ni[l]\^ 

The above statement may fail if either of the events in Eq. (j^.(P). ()10|) occurs. By 
union bound, this probability is at most S/w"^. Also note that the noise energy equals 
the received energy in all other bands where no energy is transmitted. Hence with at 
least a probability of 1 — the average received energy at time 1 follows: 



w w 

w 

where a equals the first term in and is a non-random variable. By weak law of large 
numbers, the second term in converges to 1 for large w. Average received energy Et 
at any other time t ^ 1 equals 



w 

A type II error occurs when any of these Et exceeds a + 1. Since each |nfc[t]p is an 
exponential random variable with mean 1, applying Chernoff's bound on similar lines of 
jS] gives 

P{Et>a+l) < exp(-«;L(a)) (13) 
where L{a) = a — ln(a + 1) (14) 

Applying union bound over all wrong messages from 2 to T, we get the following bound 
on type II error probability 

Pji < Texp {—wL{a)) = exp {—w{L{a) — \nT/w)) 

Since Pj goes to zero with increasing w as shown before, the overall error probability 
vanishes with increasing w if P// also vanishes with increasing w. This happens if 

\nT/w < L{a) (15) 
^ InT/T < |L(a) = I (a - ln(a + 1)) (16) 

Thus the maximum achievable rate^ depends on a and hence depends on A (because a 
equals lA^s^Ainn). 

^We have shown that the error probabihty of this orthogonal code goes to zero as w goes to infinity. 
However, a subtle point is that for showing a rate InT/T is achievable, we have to show that arbitrarily 
small error probability can be achieved for a given (but large) w and T. As shown in this can 
be achieved by coding over many blocks of our orthogonal code by treating the orthogonal code as 
the inner code of this concatenated code. The orthogonal code provides an essentially noiseless discrete 
memoryless channel (with input cardinality T) for the outer code. Thus a rate ~ In T/T can be achieved. 



Let this scheme be apphed only for S fraction of the time where 5 is a suitably chosen 
parameter. No communication happens in the remaining fraction of time. Thus if p is 
the overall average power available for this piece of bandwidth, p/S is the average power 
available when communication is being done. Thus the peakiness denoted by 6 boosts 
the power level for actual communication by a factor of 1/6. This boost in power level 
is necessary for the success of this orthogonal code. It ensures that the energy pulse 
transmitted (for the correct message) is strong enough to be identifiable at the receiver 
from incorrect messages. 

Since the time-length of this code is T, total transmit energy A for this code is equal 
to Tp/5. Since communication happens for only 5 fraction of time, the overall maximum 
achievable rate is given by 

w , ^pT/5-2^Wr5^^w 

r = d—L[a ) where a = 

T w 

Since the total available power P is divided equally amongst K pieces of the total band- 
width, power available per piece equals p = P/K. We choose K = law and 6 = eT/w, 
where e > is a small number. Substituting these values yields a* ~ P/e. Now note that 
L{a) ~ a for large a. Since a* can be made arbitrarily large by choosing small enough 
e, the maximum achievable rate is given by 

w ^ 



eTwP 
w T e 



P) 



Since there are K = Inw such pieces of bandwidth w, the total rate rK equals Plnw 
nats per unit time. The total bandwidth for these K pieces equals W = wliaw. Noting 
that InPi^ Inw, the total rate is given by PlnW nats per unit time. 

This rate expression matches the capacity of this fading channel when the receiver 
and transmitter both have full CSI 0. This proves that the proposed coding scheme 
achieves the capacity for this channel with no receiver CSI. Thus the lack of receiver CSI 
does not reduce capacity-a phenomenon similar to writing on dirty paper. 

Theorem 1 Capacity of the Rayleigh fading wideband channel with causal transmitted 
CSI and no receiver CSI is achieved by the proposed coding scheme. In the limit of large 
bandwidth, this capacity C ~ PhiW nats per unit time and is unchanged if even the 
receiver has full CSI. 

Note that as mentioned in Section 1, full transmitter CSI is not needed for the pro- 
posed scheme; only one bit of CSI is enough for each channel gain hj [j] . This bit indicates 
whether or not the channel gain is above the threshold $. Also note that CSI is not needed 
at every time for this scheme. Since there is no activity for {1 — 6) fraction of time and 
only 6 fraction of the time is used for communication, only this 6 fraction of time needs 
CSI and hence the cost of obtaining CSI is significantly reduced. Since the capacity of a 
wideband channel with full receiver and transmitter CSI (at all times) is essentially the 
same as the capacity of our channel with only transmitter CSI (for only a fraction S of 
time), one may want to mimic no receiver CSI even when it is available! 

We can extend above analysis for the case of noisy transmitter CSI, where the channel 
gain hi[j] equals the sum of two independent complex Gaussian components, and 
fj[j], which are i.i.d. over frequency and time. Transmitter only knows {gi[j]} and the 



error fj[j] is independent of gj[j]. The variance of the known component is /3 G (0, 1] and 
hence that of the error is 1 — /3. A code similar to the perfect CSI case is employed. For 
example, if message 1 is to be transmitted, the transmitter transmits energy only in the 
frequency band where the known channel strength |gj[l]p is larger than Thus the 
threshold for the perfect transmitter CSI case is reduced by a factor of (3. This scheme 
can be shown to achieve a rate of (3P In W nats per unit time. This again equals the 
capacity when receiver also has full CSI [7j. Thus again receiver CSI is irrelevant for 
capacity in the limit of large bandwidth. 

Remark 1: Similar results can be proved when distribution of the fading gain |hj[j] p 
is not exactly exponential but has an exponential tail. If the tail behaves similar to an 
exponential with mean m, the capacity can be shown to be mPlnW nats per unit time. 

Remark 2: Similar analysis can be performed if the tail of the fading gain distribu- 
tion is a polynomial, that is, P(|hj[j]p > x) ~ for some n > 0. In that case, the 
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proposed code achieves a rate R ~ PW"+^ nats per unit time. This again turns out to 
be the same as the capacity when the receiver also has full CSI. 

Finally, (on similar lines of j^) we can interpret the proposed scheme in terms of the 
binning argument in [21 El- For the binning interpretation, logarithm of the number of 
codewords per message should equal the mutual information between the state sequence 
{hj[i]} and the input sequence {xj[i]}. The number of possible codewords per message 
equals w in our code as energy can be transmitted on any of the w bands available in a 
piece. Now note that in our code, the state sequence completely determines the input 
sequence for a given message, because we transmit all energy only where the channel 
gain first crosses the threshold $. Hence the above mutual information equals the input 
entropy for a message. Since probability of no frequency band crossing the threshold 
goes to zero for large w and any of the w frequency bands are equally likely to cross 
the threshold, entropy of the input tends to logw. Thus the binning interpretation is 
justified as the logarithm of the number of possible codewords per message equals the 
mutual information between the input and state sequences. 

3 Capacity per cost with causal transmitter CSI 

We saw in the previous section how the proposed orthogonal code achieved the capacity 
of the wide-band fading channel with no receiver CSI. It also means that the proposed 
code achieved the capacity per unit cost for that channel. This section analyzes the case 
of causal transmitter CSI for a more general channel. 

The random variables at time i G {1,2,3- ••} corresponding to the channel input 
Xi, output Yi and channel state Si take values from the sets X,y and S respectively^. 
State S defines a channel transition matrix denoted by Py\xs- The states are assumed to 
change i.i.d. over time, that is, if Ps{-) denotes the distribution of Si then the probability 
of a state-sequence s[ equals Y[\=i ^siSi = Si). Conditioned on the state sequence, the 
channel is assumed to be memoryless i.e. 

I 

P(Yl\XlS[) = l[PY\Xs{y^\x^,S,) 
i=l 



•^Unless stated otherwise, capital letters denote random variables and small letters denote their values. 
Notation X{ is used as a shorthand for the sequence X1X2 ■ ■ ■ Xi. 



Each input x G incurs a cost b{x) G [0, oo). A zero cost input is assumed to exist 
and denoted by "0". In a code of length /, the codeword for message j is denoted by 
the sequence x[{j). A length / code having M G {1,2---} messages is denoted by a 
(/, M, u, e) code if the average probability of error is at most e and codeword for every 
message j satisfies the total cost constraint 

I 

J2b{x,{j))<u 0<j<M (17) 
1=1 

The capacity per unit cost for this channel is defined as in jH]. 

Definition 2 For a given < e < 1, rate (in nats) per unit cost R is said to he e- 
achievable if for all every 7 > 0, there exists a uq such that for all v > pq, a (/, M, i/, e) 
code can he found with InM > z/(i? — 7). Rate per unit cost of R is said to he achievahle 
if R is e-achievahle for every e > 0. Capacity per unit cost is the maximum achievable 
rate per unit cost. 

We assume no receiver CSI and causal transmitter CSI, which means that the transmitter 
gets to know Si at time i before transmitting Xj. Let U : S ^ X denote a mapping 
from states to inputs. This mapping U is equivalent to a vector in A'l'^l, where its each 
entry denotes the input mapped from the corresponding state. Let PY\u=u{y) denote the 
output distribution induced when mapping U = u is chosen, that is, 

PY\U=uiy) = Y,Psis)PylxsiyHs),s) (18) 

ses 

where u{s) denotes mapping of state s under u. We next prove the following theorem. 

Theorem 3 Capacity per unit cost with no receiver CSI and causal transmitter CSI is 
given hy'^ 

D{Py\u=u\ |-Py|!7=o) 
T £[h{X)\U = u] 

where D{PY\u=u\\PY\u=i)) denotes the relative entropy (in nats) hetween the output distri- 
butions induced when mapping u is chosen and when identically zero mapping is chosen. 
£\b{X)\U = u] denotes the average cost incurred when mapping u is chosen. 

g[b{X)\U = u] = J2Ps{s)b{u{s)) 

Proof: We first show an orthogonal coding scheme which achieves the above rate per 
unit cost. We use the shorthand f{n) = g{n) to denote lim„^oo = 1- Similarly, 

fin) < gin) and fin) < gin) are defined. 

Choose a mapping u : S ^ X. Our code of M messages spans Mn symbols. Each 
message corresponds to a non-overlapping interval of length n, that is, message j G 
[0,M — 1] corresponds to interval^ [jn + 1, + n]. If message j is to be transmitted, 
"0" is transmitted at all times except interval [jn + 1, jn + n]. During each time i G 
[jn + 1, + ra], input uiSi) is transmitted. This requires only causal CSI at the encoder. 

*We assume that relative entropy and mutual information are measured in nats i.e. with natural 
logarithm. 

""This means the set of integers from jn + 1 to jn + n. 



Assuming message j was transmitted, the output distribution at each time in interval 
[jn+1, jn+n] is given by Py\u=u- Outputs in all other intervals are distributed as Py\u=o- 
For each of the M intervals of length n, the decoder finds the empirical output distribution 
of that interval. Let Py denote this empirical distribution for interval [kn+1, kn+n]. The 
interval k for which D{Py\\Py\u=o) is larger than a threshold^ $ = D{Py\u=u\\Py\ij=q) — S 
is declared as the transmitted message, where 5 is a chosen small number. An error is 
declared when none or multiple such intervals exist. 

First kind of error occurs if the divergence D{Py\\Py\u=o) for the correct interval does 
not exceed the threshold D{Py\u=u\\Py\u=o) ~ ^- By Sanov's theorem (e.g. [H]), this 
probability goes to zero exponentially fast in n. Hence this probability of error of first 
kind is smaller than e/3 for all n > ni for some ni > 0. 

The second kind of error occurs if the divergence D{Py\\Py\u=o) for a wrong interval 
k j exceeds the threshold. Again applying Sanov's theorem implies 

P {D{P^\\Pyiu=o) > *) = exp(-r2<l>) (19) 

By union bound, the probability Pu that any of the M — 1 wrong intervals crosses this 
threshold is bounded by 

Pii < Mexp(-n<l>) 

If we choose M = exp(n($ — 5)), probability Pu also goes to zero exponentially as 
exp{—nS). Thus the probability of error of second kind is smaller than e/3 for all n > n2 
for some n2 > 0. 

By i.i.d. nature of the states and the law of large numbers, total cost for each message 
is smaller than n{S [b{X)\U = u] + 6) with (at least) a probability of 1 — e/3 if n > is 
chosen for some > 0. 

Thus even if an error of third kind is declared if the total cost of the codeword 
exceeds the threshold, the total probability of any kind of error is less than 3 (e/3) for all 
n > max(?T,i, n2, -n-a). Thus for any e > and 7 > 0, we can choose small enough 6 such 
that 

InM = n($ - 5) 

= n{DiPYiu=u\\PY\u=o)-2S) 

,^r.,^,.^^ -, , f D{Py\u=u\\Py\u=o) 

and the probability of error is smaller than e for n > max(ni, 712, n^) = n^,. Substituting 
z/ = n{£ \b{X)\U = u] + 5) and z/q = n^:{£ \p{X)\U = v\ + 5) in the definition of the rate 
per unit cost proves that the proposed orthogonal code achieves a rate per unit cost of 
D{PY\u=u\\PY\u=^)/S[h{X)\U = u]. 

Remark 3: Note the similarity of this scheme with the coding scheme in previous 
section for the wideband fading channel. In particular, note that the probability of an 
incorrect interval crossing the threshold $ is given by exp(— «$), similar to (fT^ . For 
these reasons, one can interpret the divergence D{Py\\Py\u=o) for interval i as the discrete 
channel analogue of the average received energy Ei for the wideband fading channel. 

^This threshold is finite if the support of Py\ij^u is contained in that of Py|(7=o- This threshold is 
not finite if there exists an output (say y) which can only occur with a non-zero input. The decoding in 
that case would be easy because only the correct interval can have the output y. 



Proof of converse: We first note the following upper bound in on capacity per 
unit cost of a discrete memoryless channel with input V and output Z 



sup 



D{PZ\V: 



=V 




(20) 



V 



where Pz\v=v denotes the output transition probability for input v, c{v) denotes the cost 
of input V and ^ = denotes the zero cost input. 

Now recall Shannon's idea that this channel with causal transmitter CSI and i.i.d. 
states can be thought as a discrete memoryless channel (DMC) with the same output 
alphabet but a larger input alphabet. The input alphabet U of that equivalent DMC 
corresponds to a mapping from S to X and thus its cardinality equals lAfll"^'. An input U 
of this DMC is equivalent to a vector in X^^^ made up of contingent inputs (from X) for 
each state s G 5. A code for the DMC can be converted to a code for causal transmitter 
CSI channel as follows. If the symbol Ui was transmitted at time i on the DMC, the 
transmitter with causal CSI transmits input Ui{Si) at time i after observing state Si. 

This DMC is a cascade of two memoryless parts. First part chooses the state 
with distribution Pg and picks the corresponding contingent input Ui{Si) G X from 
the transmitter. Second part is similar to our original channel of interest, which emits 
the output based on the state Si and the input Ui{Si) according to the distribution 
PY\xs{-\ui{Si), Si). The output distribution of this DMC conditioned on the input u is 
given by Py\u=u in (|IH1)- 

Finally, note that S [b{X)\U = u] denotes the (average) cost incurred due to choosing 
the DMC input u. The converse follows by applying after replacing Pz\v=v by Py\u=u, 
Pz\v=o by Py\u=o and c{v) by £ [b{X)\U = u]. A more detailed converse is proved in the 
appendix. 

We could also prove the direct part of this theorem using above method of conversion 
to a DMC. However, the earlier detailed proof is expected to be more insightful in view 
of writing on fading paper. 

4 Discussion 

For the wideband fading channel, we noted that the capacity with causal CSIT^ was 
the same as that with non-causal CSIT. Equivalently, the capacity per unit cost was the 
same with causal or non-causal CSIT. Similar phenomenon can be shown for the AWGN 
channel with additive Gaussian interference known at the transmitter by a modification 
of the scheme in |Hj. We want to understand whether these are isolated examples (of 
the equivalence of capacity per unit cost with causal and non-causal CSIT) or they are 
special cases of a general class. 

If with causal or non-causal CSIT, the capacities (for any given cost constraint) are 
the same for a channel; then it is easy to show that the capacities per unit cost would 
also be the same for that channel. This is because for a channel with a cost alphabet, 
capacity per unit cost is given by the slope of the capacity vs. cost curve at 0. 

More interesting problem is to characterize the class of channels for which the capacity 
per unit cost is the same with causal or non-causal CSIT, but the capacity vs. cost curves 
are not the same for causal and non-causal CSIT. Above mentioned wideband AWGN 
channel and wideband fading channel with additive interference are two such channels. 

^CSIT: Acronym for transmitter CSI. 



4.1 Review of the non-causal transmitter CSI case 



We briefly summarize the coding scheme that achieves the capacity per unit cost with 
non-causal CSIT 8J. This code of M messages spans Mqn symbols. Each message in 
this orthogonal code corresponds to a separate interval of length qn. For transmitting a 
message j, non-zero symbols can be only transmitted in the j'th interval of length qn. 
This message interval of length qn can be thought as the set of q subintervals, each of 
length n. 

A distribution of states Ps{-) is chosen beforehand. Out of these q subintervals in 
the interval for message j, the subinterval whose empirical distribution is like Ps{-) is 
chosen. More precisely, the divergence of the empirical distribution of this subinterval 
with respect to -P5'(-) should be small enough. Since the actual distribution of states is P5, 
the probability of a subinterval having distribution like Ps{-) is essentially (in = sense) 

given by exp {^nD{Ps\\Ps)^ ■ We can find such a subinterval with high probability if 

the number of subintervals per message interval is 

q = exY>{nD{Ps\\Ps)) (21) 

Non-zero symbols are only transmitted in this subinterval. A mapping u : S ^ X is also 
chosen beforehand. Similar to previous section, input u{s) is transmitted for state s in 
this subinterval. The output distribution in this subinterval would be 

Pviy) = Ps{s)Py\xs iyHs), s) (22) 
ses 

Output distribution in all other subintervals (where only input is transmitted) is Py\u=o 

PY\u=o{y) = Ps{s)Py\xs {y\o, s) 
ses 

Note that non-zero symbols are transmitted in a small fraction [l/q) of the interval 
corresponding to message j. Note from (^1]) that this fraction decays exponentially to 
with increasing n. Also note that non-causal CSIT is necessary to determine the 
subinterval having empirical distribution like Ps{-)- 

At the receiver, empirical distribution is found for all the q subintervals for each 
of the M message intervals. If one of these Mq subintervals has distribution like Py 
in (j221); the message interval containing that subinterval is declared as the transmitted 
message. An error is declared otherwise. Since every wrong subinterval is distributed 
as Py|[7=o, probability of its having an empirical distribution as Pyiy) is essentially 
exp{—nD{PY\\PY\u=o))- Thus by union bound, the probability of a wrong subinterval 
having output distribution Pyiy) is 

Mqexp{-nDiPY\\PY\u=o)) = Mexp (nDiPs\\Ps) - nDiPY\\PY\u=o)) 

Choosing M = exp ^n(D(Py | |-Py|(7=o) ~ D{Ps\\Ps))^ can ensure that probability error 
vanishes with large n. By law of large numbers, the total cost incurred for transmission 
is essentially nSp^ [b{u{S))] where 

SpJb{u{S))] = J2Ps{s)b{u{s)) 
ses 



Thus the rate per unit cost achieved by this scheme is 

InM ^ D{Py\\PY\u=o) - DjPsWPs) , . 

nSpJbiuiS))] SpJbiuiS))] ^ > 

Optimizing above expression over the choice of Ps and u{-) can be shown to yield the 
capacity per unit cost for this channel with non-causal CSIT. We denote an optimum 
choice by Pg and «*(■), respectively. 



4.2 Adapting to the causal transmitter CSI case 

With causal CSIT, transmitter does not a priori know the subinterval having empirical 
state distribution Ps- To overcome this issue, let there be only one subinterval per 
interval i.e. let g = 1. Thus each message corresponds to an interval of length n. Now a 
fraction 6 is chosen by the transmitter. The transmitter can only transmit energy (non- 
zero symbols) in a fraction 6 of message interval. For each state s G iS, the transmitter 
will transmit input u{s) for the first nOPs{s) occurrences of state s. Thus the states 
where energy is transmitted will have an empirical distribution Ps- Since the actual 
state distribution is P5, (by law of large numbers) an interval of length n will have 
n6Ps{s) occurrences of state s only if 

0Ps{s) < Ps{s) yseS ^e<M {Ps{s)/Ps{s)} (24) 

Note that the n6 symbols where energy is transmitted need not be in a contiguous block. 
Again by law of large numbers, the total cost incurred in this procedure is (in ^ sense) 
essentially n6£p^ \b{u{S))]. 

At the decoder, for each message interval of length n, the empirical output distribution 
is found for all (^g) subsequences® of length nO. Out of these M{^^^ subsequences, 
if all the subsequences having distribution like Py in belong to a single message 
interval, the message corresponding to that message interval is declared as the transmitted 
message. An error is declared if more than one or none of the message intervals have such 
subsequences. By law of large numbers, the correct subsequence of length n9 where energy 
is transmitted will have an empirical output distribution like Py with high probability for 
large n. A length nO subsequence in an incorrect message interval will have an empirical 
output distribution like Py with probability pi given by 



Pi =exp (^-enD{Py\\Py 



Applying union bound, the probability of a subsequence of an incorrect message interval 
having empirical output distribution like Py is bounded by M("g)pi. Hence vanishing 
error probability can be achieved if 



M = 



I exp (^enD{Py\\Py\ 

(ne)Pi ine) 

exp (9nD{Py\\Py\ 



u=o. 



u=o, 



exp (nHkie)) 
expln{eD{Py\\Py\u=o)-H,{e)) 



by Sterling Approximation 



^The term subinterval is reserved for contiguous blocks of symbols, whereas a subsequence need not 
be contiguous. 



The rate per unit cost achieved by this scheme equals 

InM ^ eD{PY\\PY\u=^)-H^{e) 

nQEp^ \h{u{S-))\ QEp^ 

D{pY\\PY\v=^)-E^{e)ie 

For this to equal the capacity per unit cost with non-causal CSIT in p3|l . an optimum 
maximizing ()2H|1 should satisfy 

E,{Q)IQ = D{Pl\\Ps) 

Now note that implies 

^(^511^5) = 

< 
< 

Last step is met with equality either when ^ = 1 or when Q tends to zero. For equality 
in second step, we need P^jPs = 1/0 for all states having Pgi^s) > 0. 

Case when ^ = 1 corresponds to Pg = P5. By law of large numbers, empirical 
distribution for each interval would be P5 with high probability. Thus the non-causal 
nature of transmitter CSI is rendered useless in this case because only 1 subinterval (i.e. 
g = 1) suffices per message interval. 

Thus this coding scheme gives the following sufficient condition for the capacity per 
unit cost with causal or non-causal CSIT to be the same. 

Theorem 4 Let ji denote inf^g^j^^} for an optimum P| that achieves the capacity 

per unit cost for non-causal CSIT in \2'J\) . For equivalence of capacity per unit cost 
with causal and non-causal CSIT, /i should either he arbitrarily small or he equal to 1. 
Moreover, for all states in support of Ps (i.e. states having Ps{s) > 0) should achieve^ 
the infimum fj, = Ps{s)/Ps{s) . 

If /i tends to zero, the divergence D{Pg\\Ps) should tend to infinity to satisfy the above 
condition. In other words, its arbitrarily rare to observe the source distribution where 
energy is transmitted. This is because (by Sanov's theorem) the larger D{Pg\\Ps) is, the 
rarer it is to have empirical distribution like P5 when actual state distribution is Ps- 

Note that the wideband fading channel and the wideband writing on dirty paper jH] 
satisfy the above Lemma, which gaurantees that capacity per unit cost is the same with 
causal or non-causal CSIT. The fraction of states 6 where energy was transmitted was 
arbitrarily small there. Thus the above Lemma explains some reasons for the equivalence 
of the capacity per unit cost with causal and non-causal CSI for those channels. 

With this background, we revisit the capacity achieving scheme for the non-causal 
CSIT case. The state vector of each length n subinterval can be viewed as a superstate 

^If fj, tends to 0, this clause can be relaxed as long as Hi,{fi)/fi approaches D{Pg\\Ps)- 



5.1.,. (f) 



of cardinality \S\"'. Now each message in the code corresponds to an interval consisting 
of q super- symbols (or subintervals) . 

A subinterval of empirical distribution corresponds to a superstate with probability 
exp{—nD{Ps\\Ps))- Energy is only transmitted in these rare subintervals. Non-causal 
CSIT of a length n subinterval in the original channel corresponds to causal CSIT in the 
super-channel. The idea of subintervals has thus converted the channel with non-causal 
CSIT to a channel with causal CSIT. 

The causal CSIT channel (with superstates) has some arbitrarily rare superstates 
where energy is transmitted for achieving capacity per unit cost. Hence by Lemma EJ the 
capacity per unit cost for this super-channel is the same with causal or non-causal CSIT. 
Since non-causal CSIT for the super-channel also means non-causal CSIT for the original 
channel, the capacity per unit cost of the super-channel for causal CSIT equals the 
capacity per unit cost of the original channel for non-causal CSIT. Thus even if Lemma E] 
is not satisfied for the original channel directly, the idea of subintervals achieves the non- 
causal capacity per unit cost by converting the original channel to a super-channel for 
which Lemma m is satisfied. This is achieved by providing arbitrarily rare (super)states 
for transmitting energy. 

Acknowledgements 

Thanks to Robert Gallager for suggesting a simple On-Off fading channel, which prompted 
the writing on fading paper scheme. Shashi Borade also acknowledges numerous insight- 
ful comments and suggestions by Ashish Khisti. 

Appendix: Proof of Converse of Theorem El 

We use a technique similar to [OjlH], which adapts a converse for capacity to a converse 
for capacity per unit cost. For a code of length /, by Fano's inequality we know the 
following necessary condition for transmitting a message m chosen uniformly out of M 
possible messages with error probability smaller than e. 

(1 - e) InM < I{m; F/) + Hb{e) 

where Hh{-) denotes the entropy of a binary variable as a function of its probability of 
being 1. From the converse for the capacity of a channel with causal transmitter CSI, we 
know that PP 

I{m■,Y|)<J2HU^■,Y^ (25) 

1=1 

where /(?/«; Yi) denotes the mutual information between the state to input mapping Ui 
and output ¥{. The mapping Ui is considered as a random variable of cardinality 
The mutual information can be thought as the mutual information of a channel with 
input Ui and output Yi, where the output transition probability for input u is given by 
Py\u=u in (Hi). 

Now we introduce a time-sharing random variable Q, which is independent of all other 
variables and is uniformly distributed over integers from 1 to /. This gives the following 



upper bound 



J2HU^■,Y,) = II{Uq-Yq\Q) 



1=1 



= 1{I{Uq,Q;Yq)-I{Q-Yq)) 
< II{Uq,Q-Yq) 

Defining U = {Uq, Q) and defining Y = Yq, we get the upper bound on (1 — e) InM as 
II{U; Y) + H,{e). 

Now assume a weaker average cost constraint instead of the per codeword cost con- 
straint in (|T7jl as follows 



i=l 



< V 



Using the time sharing variable and later replacing Xq by X gives 



1=1 



J2^mQ)\ Q = i]=l£ HXq)] = IS [6(X)] 



i=l 



Combining this with the upper bound on (1 — e) InM gives 

InM ^ I{U;Y) + H,{e)/l 
V - {\-e)E\h{X)\ 

As e can be arbitrarily small and / can be arbitrarily large, we get In Mj v < I{U] Y)/S [b{X)] 
as the necessary condition for arbitrarily small error probability on a code. Thus a code 



InM ^ I{U;Y) 



with arbitrarily small error probability e must satisfy 
random variable U (which denotes a mapping from states to inputs) 
Now note that mutual information I{U] Y) can be written as 



for some choice of 



IiU;Y) = J2Puiu)DiPYiu=u\\PY) 

u 

= J2Puiu)D{PYiu=u\\PY\u=o) - D{Pyiu=o\\Py) 

u 

< J2Pu{u)D{Py\u=u\\Py\u=o) 



(26) 
(27) 
(28) 



where Py\u=o indicates the output distribution when the state to input mapping is iden- 
tically zero i.e. when input is transmitted for any state. Also note that the expected 
cost can be written as 

£[biX)] = J2Puiu)£[biX)\U = u] 

u 

gives that any code with arbitrarily small error probability 



Combining this with 
should satisfy 



^Y.uPu{u)D{Py 



\u=u\ \Py\u=o) 



T.uPu{u)£[h{X)\U = u] 
for some distribution Pu{-) of the state to input mapping U . In other words, 



InM Y.uPu{u)D{P, 
< sup 



y\u=u\\Py\u=q) 



V 



Pu i:uPu{u)£[h{X)\U 



D(Py 



. ''y\u=u\\Py\u=o) 
T £[b{X)\U = u] 



Q.E.D. 
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